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Abstract 

We propose novel methods for predictive (sparse) PCA with spatially misaligned data. 

These methods identify principal component loading vectors that explain as much variability 
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in the observed data as possible, while also ensuring the corresponding principal component 
scores can be predicted accurately by means of spatial statistics at locations where air pol¬ 
lution measurements are not available. This will make it possible to identify important mix¬ 
tures of air pollutants and to quantify their health effects in cohort studies, where currently 
available methods cannot be used. We demonstrate the utility of predictive (sparse) PCA in 
simulated data and apply the approach to annual averages of particulate matter speciation 
data from national Environmental Protection Agency (EPA) regulatory monitors. 


Keywords: Air pollution, dimension reduction, principal component analysis, spatial mis¬ 
alignment, land-use regression, universal kriging. 


1. INTRODUCTION 

One of the hrst well-documented air pollution events was in Belgium in 1930. In the Meuse 
Valley incident a thick fog lasted for Eve days which led to hundreds of people suffering 


from respiratory symptoms and 60 deaths during the following days (Nemery et ah, 2001). 


Another major event was the famous London Fog in 1952 which led to an estimated excess 


death toll of over 4000 (Logan, 1953). More recently, the number of extra deaths after this 


event was re-estimated to be nearly 12,000. (Bell and Davis, 2001). 


Since these events, extensive research has been conducted on the health effects of am¬ 
bient air pollution exposure. A growing body of literature presents evidence of the adverse 


effects of long-term air pollution exposure on various health endpoints (Brook et al., 2007 


Pope III et ah, 2002 Samet et ah, 2000). However, many of these studies focus on a single 


pollutant. For example, during recent years, much attention has been devoted to study the 
role of particulate matter in air pollution, specihcally particulate matter less than 2.5 mm 
in aerodynamic diameter (PM 2 . 5 ) and its effects on health (Miller et ah, 2007; [Pope III and 


Dockery, 2006). 


It is known that ambient air pollution is a complex mixture that contains multiple pollu¬ 
tants. The health studies that focus on understanding effects of a single pollutant ignore the 
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fact that typically people are simultaneously exposed to two or more pollutants, and these 
can combine to either amplify or alleviate their overall effects on the health endpoints. The 
associations between the health endpoints and the single proxy pollutants found in these 
studies are, therefore, more likely to be the effect of the mixture (that may change by space 


and time), rather than the effect of the individual pollutant (Crouse et ah, 2010). For ex¬ 
ample, health effects of PM 2.5 could change depending on characteristics of the particles. 


including shape, solubility, pH, or chemical composition of the pollutant (Vedal et ah, 2012). 
Hence, in contrast to studies based on a single pollutant approach, investigating health effects 
of multi-pollutant exposure can be useful in epidemiological studies for two main reasons: (i) 
it may provide a better understanding of potential combined effects of individual pollutants 
and interactions between the pollutants and (ii) it may result in easier interpretation of these 
effects to help inform air quality management policy. 

Analysis of health effects of long-term average, spatially varying multi-pollutant air pol¬ 
lution exposure using cohort data has two main challenges: (i) dimensionality of the multi¬ 
pollutant data and (ii) the spatial misalignment of the monitoring data and cohort study 
participants. The hrst challenge is fairly common in any analysis of high dimensional data 
when there are many independent variables of interest. In the context of understanding 
health effects of exposure to multiple pollutants, dimensionality can present difficulties be¬ 
cause of the fact these pollutants could potentially be highly correlated. Subsequently, this 
means that estimating and interpreting parameters of interest in health models to reveal the 
association between the health endpoint and the predictors could be non-trivial. 

The second challenge arises because multi-pollutant exposures are not always available 
at all subject locations in a cohort study. This challenge is obviously not unique to multi¬ 
pollutant exposure studies. For example, in epidemiological studies of health effects of a 
single pollutant, the common framework to deal with this issue is to use prediction models 
to assign exposures to study participants based on monitoring data available at different 
locations. To investigate the effects of multi-pollutant exposure, this framework requires 
building multiple prediction models and assigning exposures to each pollutant at all study 
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participant locations. This means that if one of these prediction models prodnces inaccnrate 
predictions, the resnlts of the health analysis may become nnreliable. 

While there are methods to resolve the issnes with dimensionality and spatial misalign¬ 
ment in varions contexts, these methods do not deal with these challenges simnltaneously. In 
order to nnderstand health effects of mnlti-pollntant exposnres and health endpoints, these 
challenges need to be solved together. In this paper, we propose an approach for dimension 
reduction that can be applied to mnlti-pollntant spatial data that resolves these issnes in a 
compntationally fast nnihed approach. Onr approach seeks to find sparse principal compo¬ 
nent scores that explain a large proportion of the variance in the data while also ensnring 
that mixtnres derived from these components are predictable at health endpoint locations. 
Predictions of the lower dimensional component scores can then be effectively nsed in a 
health effect analysis. We show that the new approach is preferable to the seqnential two- 
step approach of dimension rednction followed by spatial prediction, which may resnlt in 
principal component scores that are difficnlt to predict at nnmeasnred locations. We apply 
onr approach to national mnlti-pollntant air pollntion data from the U.S. Environmental 
Protection Agency. 

Two widely nsed models for predicting exposnres at snbject locations are based on land- 


nse regression (LUR) with Geographic Information System (GIS) covariates (Braner et al. 


2003 Hoek et al., 2008) and nniversal kriging (UK) (Jerrett et ah , 2005 Kiinzli et al., 2005). 


In addition to nsing GIS covariates to constrnct the mean strnctnre of the model as in 
land-nse regression, UK also allows for spatial dependence by modeling correlations between 
residnals at different locations ( |Kim et ah 2009 Mercer et ah, 2011 Sampson et al., 2013). 
Since this nsnally results in more accurate predictions of pollutants, we use UK throughout 
the paper. 

The rest of the paper is organized as follows. Section [^describes the available monitoring 
multi-pollutant data and geographic covariates. Section describes in detail the traditional 


and predictive principal component analysis approaches. Here, in Section [3721 we also briefly 
give an overview of universal kriging that is used in predicting pollutants and PG scores at 
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new locations. In Section we present results from our simulation analysis to demonstrate 
key differences between different methods. In Section]^ we describe the application of PCA 
approaches to the monitoring data. Finally, in Section we summarize our methods and 
results and discuss our statistical approach and conclusions. 


2. DATA 

2.1 Monitoring data 

Air pollution data are collected nationally by a network of monitors called the Air Quality 
System (AQS). Most of these data are accessible at the Environmental Protection Agency’s 
(EPA) website at http://www.epa.gov/ttn/airs/airsaqs/detaildata/downloadaqsdata.htm, While 
measurements of PM 2.5 are available at almost all AQS monitor locations, detailed data on 
other pollutants and PM 2,5 components is only collected at two sub-networks of the AQS 
network. One of the sub-networks of the AQS with multi-pollutant data is called the Chemi¬ 


cal Speciation Network (CSN) (EPA, 2009). CSN monitoring locations are generally selected 
to assess population multi-pollutant exposures, and therefore are mostly located in urban 
areas. Another sub-network of the AQS with multi-pollutant monitoring data is called the 


Interagency Monitoring of Protected Visual Environments (IMPROVE) network (Eldred 


et ah, 1988). IMPROVE monitoring locations are located in national parks and some rural 


areas. Data collected by the IMPROVE network are available at the Visibility Informa¬ 
tion Exchange Web System (VIEWS), at http://views.cira.colostate.edu/web/, IMPROVE 
monitoring locations are selected to assess pollution impacts in sparsely populated environ¬ 
ments. Therefore, most of these monitors do not represent exposure levels for (the majority 
of) subjects living in populous areas. 

The AQS monitors are located throughout the United States. However, as funding avail¬ 
ability and monitoring aims evolve, some new monitors become available or/and some old 
ones are discontinued, and monitoring schedules are changed. For example, the monitors at 
a given location that collected PM2.5 data from 2000 to 2002 may not have collected PM2.5 
data from 2003 to 2005. Furthermore, an individual monitor that collected PM 2,5 data every 
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day in 2002 may have collected data every third day in 2003. Another monitor that collects 
PM 2.5 data every day during the winter may only collect data every third day in the summer. 
Since these features complicate the analysis of the data, we apply the methods described in 
this paper to annual averages data for 2010. In 2010, the data from the AQS monitors were 
at a daily resolution. These data are converted to annual averages using completeness rules 
making sure that each monitor had at least 10 data points per quarter and a maximum of 
45 days between measurements. Subsequently, before the analysis, the annual averages were 
square-root transformed. 

Initially, we had 7375 monitors with incomplete data for 28 pollutants: PM 2 , 5 , PMio, 


NO 2 , NO^, SO 2 , O 3 , EC, OC, SO 4 , NO 3 , Al, As, Br, Cd, Ca, Cr, Cu, Co, Fe, K, Mn, Na, S, 


Si, Se, Ni, V and Zn. Since we needed complete multi-pollutant data for our analysis, our 
hrst goal was to obtain a list of pollutants with a large number of monitors without missing 
measurements for all pollutants in the list. After cleaning the original data, our hnal list 
included 19 pollutants: PM 2 . 5 , EC, OC, Al, As, Br, Ca, Cr, Cu, Fe, K, Mn, Na, Ni, S, Si, V 
and Zn. For these pollutants, the number of monitors with full data was equal to 284. The 
pollutants removed from the analysis were PMio, NO 2 , NOa,, SO 2 , O 3 , SO 4 , NO 3 , Cd, and 
Co. We note that adding any of the pollutants PMio, NO 2 , NOj,, SO 2 or O 3 to our hnal list 
reduced the number of monitors from 284 to a number between 44 to 55, while adding any 
of SO 4 , NO 3 , Cd, or Co reduced the number of monitors to 123. We also note that S is a 
good surrogate for SO 4 , and Cd and Co are not particularly good tracers of major sources 
believed to be scientihcally important, while NO 3 has sampling artifact issues associated 
with its volatility. 

The measurements for the pollutants from the hnal list were from CSN (130 monitors) and 
IMPROVE (154 monitors) networks. The data from these networks for these pollutants were 
reported in units of mg/m^. Additionally, we note that in 2010 the CSN and IMPROVE 
networks collected data using similar methods. For example, for PM 2 . 5 , both networks 
used Federal Reference Methods (FRM) to collect data. IMPROVE data and AQS FRM 
data collected concurrently at the same locations were comparable enough to be considered 
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equivalent. Table [T] shows the list of retained pollutants with related summary statistics. 
Here, based on R'^s (dehned as = max(0,1 — where Xi is the observed, Xi is 

the predicted values of the pollutants) and mean squared errors (MSEs, dehned as MSE = 
'j obtained using cross-validation and universal kriging (reviewed in detail in Section 


(3.2)), we see that while some pollutants are highly predictable (e.g. PM 2.5 or S have high 


R^), some pollutants cannot be predicted well (e.g. Mn). We note here that predictions used 
in calculations of R^s and MSEs are obtained from single-pollutant universal kriging models 
to show descriptively how individually predictive the pollutants are. 

2.2 Geographic covariates 


The description below closely follows Bergen et ah (2013). For all monitor and subject 


locations, we obtained approximately 600 GIS covariates. These data were from various 
external sources, such as TeleAtlas, the US Gensus Bureau, and US geological survey (USGS). 
The list of GIS covariates includes distances to Al, A 2 , and A3 roads (Gensus Feature Glass 
Godes (GFGG)); land use within a given buffer size; population density within a given buffer; 
and normalized difference vegetation index (NDVI) which measures the level of vegetation in 
a monitor’s vicinity. Here, GFGG Al roads are limited access highways; A 2 and A3 roads are 


other major roads such as county and state highways without limited access (Mercer et al. 


2011). For NDVI, hrst, a series of 23 16-day composite satellite images from the year 2006 


were obtained. Then, the index was converted by the University of Maryland from the -1 to 
1 scale to the 0-255 pixel brightness scale. On this scale, water has a value of approximately 
50 and areas with dense vegetation have values around 200. For each location of interest, for 
each image, all pixels with a centroid within a certain distance of the location were averaged 
(radii included 250m, 500m, 1km, and 5km). For each buffer size, hve summary numbers 
were calculated from the series of 23 averages for each location: the 25th, median, and 75th 
percentile of the entire year’s series, the median of the expected ‘high vegetation’ season, 
dehned as April 1 - September 30, and the median of the expected ‘low vegetation’ season, 
dehned as the rest of the year. 
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Table 1: Summary of the data for the retained pollutants from 284 monitors. All the 


pollutants are in nnits of mg/rn? 


Pollntants 

Mean 

SD 


MSE 

PM 2.5 

7.351 

3.814 

0.933 

0.00156 

EC 

0.417 

0.336 

0.866 

0.00299 

OC 

1.533 

0.875 

0.877 

0.00729 

A1 

0.050 

0.036 

0.582 

0.01484 

As 

0.001 

0.001 

0.817 

0.00057 

Br 

0.002 

0.002 

0.730 

0.00450 

Ca 

0.052 

0.046 

0.513 

0.00001 

Cr 

0.001 

0.003 

0.497 

0.00013 

Cn 

0.003 

0.004 

0.555 

0.00689 

Fe 

0.065 

0.068 

0.545 

0.00039 

K 

0.051 

0.027 

0.668 

0.00009 

Mn 

0.002 

0.006 

0.269 

0.00243 

Na 

0.091 

0.114 

0.531 

0.01473 

S 

0.566 

0.352 

0.965 

0.00004 

Si 

0.135 

0.092 

0.577 

0.00477 

Se 

0.001 

0.002 

0.525 

0.00086 

Ni 

0.001 

0.001 

0.649 

0.00283 

V 

0.001 

0.001 

0.783 

0.00282 

Zn 

0.008 

0.011 

0.717 

0.00010 



Before further analysis, all GIS covariates are pre-processed to remove uninformative 
and unreliable variables with very small variability or outliers: we eliminated variables with 
> 85% identical values, and those with the most extreme standardized outliers > 7. We 
then log-transformed and truncated all distance variables at 10 km and computed additional 
“compiled” distance variables such as minimum distance to major roads, distance to any 
port, etc. All selected covariates are then mean-centered and scaled by their respective 
standard deviations. 

Additionally, we note that due to a large number of GIS covariates (some of which are 
possibly correlated), using all GIS variables directly in LUR or UK models is not practical. 
A commonly used solution is to build a model by hrst reducing the dimensionality of the 
GIS covariates into a smaller set of variables. These extracted variables are then used to 
construct the mean structure in the LUR and UK models. Various dimension reduction 
and variable selection algorithms have been proposed to deal with dimensionality of GIS 
covariates. These algorithms include exhaustive search, stepwise selection, and shrinkage by 


the lasso (cf. Mercer et ah, 2011 Tibshirani, 1996). Since most of these variable selection 
methods may be computationally expensive, an alternative dimension reduction approach 
is to apply principal component analysis to the GIS data. Similar to analogous approaches 


based on partial least squares (PLS) (Abdi, 2003; Sampson et ah, 2011), PGA allows us to 
reduce the number of original GIS covariates by using linear transformations into a smaller 
set of variables that explain most of the variability in the GIS data (see Section [^for more on 
PGA). The PGA scores obtained from GIS covariates can then be used to construct the mean 
structure in a LUR or UK model instead of using all individual GIS covariates. Therefore, 
instead of using all GIS covariates, we only use PGA scores obtained from GIS data. To 
avoid confusion, throughout the paper, we refer to these lower dimensional variables simply 
as GIS covariates. We also note that all methods in this paper can be applied with the full 
set of pre-processed GIS covariates as well. 
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3. METHODS 


3.1 Review of sparse PGA for exposure dimension reduction 

We begin by reviewing dimension reduction methods (specifically principal component anal¬ 
ysis (PGA) and sparse PGA) and their application to cohort studies when the exposure data 
are not spatially misaligned. We introduce notation along the way. 

Suppose for each subject i = in a cohort study, we observe a scalar health 

endpoint i/i and a p-dimensional exposure vector Xj. In linear regression, the most straight¬ 
forward approach to assessing associations between i/i and Xj is to estimate the parameter 
vector /3 in 

Y = Po + X(3 + e, 


possibly also including interactions. Here, Y = {yi ,..., is the vector of endpoints and X 
is the nxp exposure matrix with rows Xj. In settings when the number of monitors/subjects 
is limited compared to the number of exposure covariates and/or when there is a complex 
dependence between the covariates, it can be benehcial to reduce the dimensionality of the 
exposures. 

A dimension reduction method identihes low-dimensional structure in the exposure data 
X and given this structure provides a mapping from a full exposure vector to the corre¬ 
sponding low-dimensional representation. PGA is a widely used technique that can extract 


a small number of important variables from high-dimensional data (Jolliffe , |1986 ). In PGA, 
the low-dimensional structure is apx A; matrix V comprised of k representative p-dimensional 
exposure vectors and the low-dimensional representation is a n x A; matrix U com¬ 

prised of n subject-specihc A;-dimensional exposure vectors such that X ~ UV'^ for k < p 
and X = UV'^ for k = p. Using this representation, we can estimate (3 in 


Y =~Po+U~P + ~e. 

For k p, the regression coefficients in (3 are generally easier to estimate and (may be 
easier to) interpret than those in (3. We clarify that in the original regression problem, the 
the exposure variables could be highly correlated. This means that the components of (3 
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corresponding to each exposure variable are hard to interpret in a traditional sense as the 
amount of change in the endpoint variable y per unit increase of the exposure variable when 
all the other exposures are kept constant. In the exposure data, some pollutants always occur 
and vary together making it unrealistic to assume that pollutants can be kept constant when 
we change the values of other pollutants. On the other hand, the coefficients (3 are lower 
dimensional than f3 and they correspond to mixtures that are usually almost uncorrelated. 
This implies that the components of (3 can be interpreted as the change in the endpoint per 
one unit increase of the mixture with other principal mixtures hxed. 

We focus on unsupervised dimension reduction methods that are not informed by the 
association between Y and X. An alternative is supervised dimension reduction, which 
may sometimes identify representations that are more strongly indicative of the relationship 
between Y and X. However, we consider it promising to work with unsupervised methods, 
as the resulting low-dimensional representations have the potential of being important for 
the multiple health endpoints one faces in many epidemiology studies. 

If we apply PCA to A, we hnd orthonormal direction or loading vectors vi,..., Vp and 
corresponding principal component (PC) scores ui = Avi,..., Up = Avp. Here, dimension 
reduction using PCA is then accomplished by using the hrst k < p PC scores to represent 
the data, with k chosen such that these few scores and their corresponding loadings explain 
most of the variation in the data X. 

In traditional PCA, all entries in the loading vectors are non-zero, which can make it 
difficult to interpret individual PC scores. Several sparse PCA methods have been proposed, 
with the goal of modifying PCA so that the loading vectors contain many zeros in order to 
improve intepretability. We briefly review an approach proposed by Shen and Huang (2008), 
which exploits the connection between PCA and low rank approximation of matrices. For 
k < p, let us dehne 

= ( 1 ) 

1=1 

where ui and vi are the hrst k PC score and loading vectors. It can be shown that X^^'> is 


Shen and Huang (2008 
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the closest rank-k matrix approximation to X under the Frobenius norm. For two matrixes 
X and X, the Frobenius norm is dehned as ||X — X\\\ := tr{(X — X){X — X)"''}). This 
implies that one can conduct PCA by solving a series of optimization problems. Suppose, 
for example, that our goal is to hnd the best rank-1 approximation to the matrix X. Since 
any n x p rank-1 matrix can be written as uv"'' with n-dimensional unit vector u and 
p-dimensional vector v, we can hnd the best rank-1 approximation to X by solving the 
optimization problem 

min2,v||A — uv"^!! (2) 


with respect to u and v, under the constraint ||u|| = 1. We then set vi = v/||v|| and 
Ui = Xvi to obtain the hrst PC loading and score vectors. Notice that we constrain 
||u|| = 1 in the optimization, as this turns out to be helpful in development of sparse PCA 


(Shen and Huang, 2008). We hnd subsequent pairs (u/,Vi), / > 1 by solving corresponding 
one rank approximation problems with residual matrices obtained at the previous step. 


As described in Shen and Huang (2008), an interpretable approach to enforce sparsity in 
the loadings v can be developed via low rank approximations. To explain the algorithm, we 
focus on dehning and calculating the hrst sparse PC score and its loading vector. We obtain 
subsequent PC scores and their loadings by simply considering the residual matrices of the 
sequential matrix approximations. The main idea is to add a regularization penalty on the 
loadings via Equation ([^. In other words, to hnd the hrst sparse PC loadings and scores, 
we solve the minimization problem. 


mins^vll-^ - uv’^ll -h Pa(v), 


(3) 


with respect to u and v, under the constraint ||u|| = 1. Here, P\iy) = MVil a Li (lasso) 
penalty function with a penalty parameter A > 0. This problem can easily be solved by 


a fast alternating algorithm; see Shen and Huang (2008) for details. Similar to the above, 
after u and v are obtained, the hrst PC score is dehned by ui = Xvi (where the loading 
vector is again vi = v/||v||). After iterating this procedure for the hrst k PCs, we dehne 
our reduced dimensional exposure vector at monitor locations as U = [ui,... ,Uk]. In the 
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following, we refer to this approach as traditional sparse PCA. 

We note that the penalty parameter A > 0 controls the degree of sparsity in the loadings. 
We will discuss various options for selecting the penalty parameter in later sections, when 
we describe our new predictive sparse PCA algorithm. If A = 0 this algorithm recovers the 
standard PC loadings and scores. 

3.2 Review of universal kriging for predicting PC scores 

We can apply (sparse) PCA as described above to reduce the dimensionality of spatially 
misaligned exposure monitoring data. However, our ultimate goal is to predict the PC 
scores at subject locations where exposure data are not available. A straightforward way 
to solve this problem is to apply (sparse) PCA without any modihcation to the monitoring 
data and then use spatial statistics to predict the PC scores at subject locations. We review 
universal kriging (UK) in this context. 

First, let u = {m(s)} be a PC score (say, the hrst PC score), where s E E, and E is a set 
of geographical coordinates. If s is a monitor location, u{s) can be calculated. Our goal is 
to predict u{s*) at subject locations s*. We model u as 

m(s) = Z (s)q: + tc(s), for s E E, (4) 

where ^(s) is a set of covariates associated with each location s and ck is a vector of coef- 
hcients. Traditionally, in UK, we only use the CIS covariates. Spatial dependence is then 
incorporated by modeling {tc(s) : s G E} as a zero mean stationary Gaussian process: if 
w = (t(;(si),..., t(;(s„))''^ and ^ are the parameters of the model, then 

w|^~iv(o,r(^)), (5) 

where A'(^) is a symmetric and positive dehnite covariance matrix of w. One commonly used 
covariance function is the ‘exponential’ covariance function with parameters ^ 

(with > 0), which has the form A(^)y = = j) + fi;exp(— ||sj — ||/0), where X is 

the indicator function and ||sj — Sj|| is the Euclidean distance between locations Sj,Sj G E. 
This model is interpreted in the following manner; ip (often called the nugget effect) is the 


13 



non-spatial error associated with each location, and k (called the sill) and 0 (called the 
range) are parameters that define the scale and range of the spatial dependence respectively. 
We note that thronghont the paper, we nse the ‘exponential’ covariance fnnction to model 
the covariance strnctnre of the process. We also note that while there are more flexible 
covariance fnnctions (e.g. Matern fnnction) that can be used to model the covariance matrix 
here, we choose to use an exponential covariance function as it know to work well with air 


pollution data from EPA monitors (cf. Bergen et ah , 2013 Sampson et ah, 2013). Moreover, 
we believe that the result and conclusions on this manuscript should be robust with regards 
to choosing the form of the covariance function. 

If u is the vector of observed PC scores modeled as a Gaussian process, our goal is to 
obtain predictions of the score, u* at new locations in E. Since u is a Gaussian process, 
the joint distribution of u and u* given ^ and ck, the covariance and mean parameters, 
respectively, can be written as: 


u 

u* 


cx ^ N 


hi 

h2 


^11 ^12 
^21 ^22 


( 6 ) 


where and /X 2 are the means of u and u* respectively and ^ 11 ,^ 12 ,^ 21,^22 are block 
partitions of the covariance matrix 2^(^), given $,,(x. /ii and fi 2 are functions of ck and 
Ell, Ei 2 , E 21 , and E 22 are functions of Gonsequently, using multivariate normal theory 


(Anderson, 2003), u* | u, ck,^ is normally distributed with the mean and covariance 

E(u*|u,q:,^) = p2 + - hi) 

Var(u*|u,Q:,^) = 


(7) 


^21 ^11 ^12- 


To obtain predictions at new locations, we first obtain estimates of (^, ck), (^, o:), via maxi¬ 
mization of the multivariate normal likelihood function 7r(u|^, ck). Then, we use the formula 
for E(u*|u, CK,^) in Equation ([^ as predictions of u*. 

3.3 Predictive sparse PGA 


The two stage procedure outlined in Sections 3.1 and 3.2 can be described as (sparse) PGA 
followed by spatial prediction. The main drawback to this procedure is that there is no 
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mechanism in the hrst stage to ensure that we hnd scores that can be accurately predicted 
by a spatial model in the second stage. Suppose there is a column of X that accounts 
for a large portion of the exposure variability but does not have much spatial structure. 
With complete data, we would weight this column highly in V, but with misaligned data 
this would result in one or more columns of the score matrix U being poorly predicted at 
subject locations. This happens because the traditional PCA does not know that our multi¬ 
pollutant data is spatial data. Therefore, while the traditional (sparse) PCA does give more 
weight to pollutants with higher variability, it does not distinguish variability from spatial 
variability. It can be observed that some pollutants could be highly variable, while also being 
well informed by geographical covariates or splines. These pollutants can be predicted well 
by geostatistical model. At the same time, other pollutants could be highly variable and 
spatially unpredictable. This implies traditional (sparse) PCA may not necessarily result in 
scores that have better or worse predictability compared to individual predictability of the 


pollutants. In this section, we extend the sparse PCA approach described in Shen and Huang 


(2008) to resolve this issue by adding a constraint to Equation ([^ to force the PCs to be 
predictable from the available data. This new algorithm is termed predictive (sparse) PCA. 
We develop an alternating minimization algorithm to solve the corresponding optimization 
problem. In contrast to traditional (sparse) PCA, in predictive (sparse) PCA, we inform the 
algorithm that the data is spatial by including the CIS variables and spline terms. Thus, 
in the new PCA, we are reducing (or eliminating in some cases) the contribution of the 
pollutants that are highly variable, but spatially unpredictable. 

Let Z (s) be a set of geographic covariates and thin-plate spline basis functions calculated 
at each location s. We will use Z{s) to guide our selection of PC loadings to ensure that 


they are spatially predictable. As we noted in Section 3.2, in UK, Z{s) does not include 
splines. We include thin-plate spline basis functions in Z(s) here so that our predictive 
(sparse) PCA algorithm incorporates both geographic covariates and spatial smoothing in 
the prediction. It is known that there exists a link between interpolation using splines and 


kriging (cf. Dubrule, 1984 Hutchinson and Gessler, 1994 Matheron, 1981). One can think of 
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spline interpolation here as a low-rank kriging with hxed covariance and degree of polynomial 
trend to approximate the spatial variability. In other words, the spline basis fnnctions stand 
in for the kriging component in identifying loading vectors with predictable scores. When it 
comes time to optimally predict scores at snbject locations we revert back to UK with only 
GIS covariates in the mean model. In our approach, we use rank 10 thin plate splines. This 


is comparable to using 10 knots in conventional knot-based splines interpolation (cf. Wood 


2003). We select 10 because it was shown to work well in the literature for air pollution data 


(cf. Bergen et al., 2013). In general, however, the rank of thin plate splines can be chosen 


by hypothesis testing approaches based on generalized cross-validation. Mallow CP criteria. 


maximum likelihood-based approaches. The readers are referred to Wood (2003) for more 


details on various methods to choose the rank of the thin-plate splines. 

Let Z be the n x m-dimensional matrix of the available covariates and spline terms 
calculated at each monitor location. In order to hnd the hrst principal components of the 
multi-pollutant data, we propose to optimize 


||X-(Za/||Za||)v'||^ + P,(v) 


( 8 ) 


with respect to ol and v. Here, we use the Li penalty function P^O again. Note that this 
optimization problem is similar to the one in Equation (|^. The difference is that u is now 
constrained to be equal to Zdt/I |Zq:| | to force better predictability of the principal component 
scores. The reason to normalize Z6l is two-fold. First, this increases identihability of the 
parameters ol and v. Second, it allows us to construct a fast algorithm to hnd a solution (see 
subsections below). After optimal ol and v for the hrst PC are obtained, since {Z6l/\\Z6l\\)\'^ 
explains some part of X, we dehne the residual matrix by 

X = X - {Z6l/\\Z6l\\)w'^ . (9) 


This matrix is then used to hnd the second PC score and its loadings. The subsequent pairs 
of component scores and loading vectors can be found by considering analogously dehned 
residual matrices using the ol and v from the previous step. 
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To solve the optimization problems in Equation (1^ with respect to a and v, following 


ideas in Shen and Huang (2008), we use an iterative algorithm described below. We first 


consider the following lemmas. 

Lemma 1. For a fixed v, the ct that minimizes Equation 


has the form 


a = (z^zy^z^w, 


where W = Xv/v''"v. 

Proof. When v is fixed, it is easy to see that minimizing Equation (|^ is equivalent to 
minimizing 

\\W - Za/\\za\\\\l. 

which is equivalent to optimizing the following 

||w- Zaii^s.t. \\zay = 1 . 


Hence, applying the method of Lagrange multipliers to the latter, we obtain 

a = {z^zy^z^w. 

Lemma 2. For a fixed ci, the optimal v that minimizes Equation ([^ is obtained at 

^ = hyiCza/\\za\\), 


where hxiy) = sign(||/| - A)+. 

, Equation (|^ 

can be minimized component-wise for each j by minimizing the following 


Proof. First, let w = Za/\\Za\\. Then, following Shen and Huang (2008 


E 


Xi 


2{X + V.+Pa(v) 


Since Pa(v) = 2A ^ |vj|, the proof of the lemma then follows from simple calculations. 

3 

We now propose the following algorithm to solve our optimization problem: 


1. We hrst choose initial values for cx. and v. 


17 








2. For a fixed v, we update a. using Lemma 1. 

3. For a fixed ci, we update v using Lemma 2. 

4. We repeat Steps 2 and 3 until convergence. 


Suppose we are interested in the first k PCs. Denote the parameters estimated by solving 
Equation ([^ k times by cki, ... ,a.k and vi,..., v^. Denote the corresponding residual ma¬ 
trices by Xi ,..., Xfc, where Xi = X, X 2 = Xi — /WZa^ | |)v7, etc. For I = 1,..., k we 
can now define the 1-th PC loadings by vi = v;/||v;||. Then, we dehne PC scores as follows: 


Vi Vo 

Ui = = XVi, U2 = = XV2 .... 


|Vl| 


IWI 


( 10 ) 


As in Section [3i^ we use UK to predict the PC scores at subject locations without exposure 
data, but we now expect prediction accuracy to be improved since the loadings were selected 
to ensure predictability of the scores. Here, using the CIS covariates in predictive PCA and 
in UK may seem to raise suspicion with overfitting. To clarify that this is not the case, we 
note that the CIS covariates are not used to define the principal scores in Equation ( [Io| . We 
use the CIS covariates only to guide the selection of the loadings in the predictive (sparse) 
PCA. Additionally, to ensure that we are not overfitting the data, we validate our approach 
using out of sample 10-fold cross-validation. 


3.4 Selection of penalty parameters 

We describe two methods to select penalty parameters in sparse PCA. Recall that in both 
traditional and predictive approaches, the penalty parameters. A, control the sparsity of the 
loading vectors for each PC. Since PCA algorithms are sequential, it is easy to see that 
different A can be used for different PCs. 

The second approach is to maximize prediction accuracy of the original pollutants by 
minimizing the out-of-sample Frobenius distance ||X — UV~^\\f with respect to A for each 
principal component. Here, U is defined as the matrix of predictions of PC scores at monitor 
locations obtained via 10-fold cross-validation using UK. We use the term out-of-sample” 
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because these predictions are derived from applying PCA (predictive or traditional) followed 
by UK to each of the 10 training sets hrst and predicting the scores at the left out test 
locations. Therefore, minimizing this distance is equivalent to approximating X via PC 
score predictions, which requires that PC scores be predictable and that the loading vectors 
are not highly correlated. 


4. SIMULATION STUDIES 
4.1 Simulated Exposure Analysis 

In this section, we apply the methods discussed in Section]^ to simulated data. Our goal is 
to understand the main differences between the traditional and predictive versions of PCA 
and sparse PCA. In this and the following sections, we focus on the hrst three principal 
components. We use the proportion of variance accounted for using the principal scores 
and the consideration of simplicity in selecting the number of components in the analysis. 
With three principal scores, we are able to explain about 70-80% of the original variance 
in the simulated and real datasets in Section The threshold values of 70-80% are often 


used as a rule of thumb in the PCA literature (cf. Kim and Mueller, 1978 Miller, p998 ). In 
our manuscript, three is also small enough to be useful in presenting the material without 
creating excessively large tables. We note that using the amount of explained variance in the 
data with lower dimensional scores is not the only way to select the number of components in 
the analysis. For example, other commonly used methods include approaches based Kaiser 


criterion (Kaiser, 1960) and the scree tests (Cattell, 1966). These and various other statistical 
and non-statistical methods and statistical test to hnd optimal number of components are 


discussed in detail in Bryant and Yarnold (1995); Peres-Neto et ah (2005) and Gorsuch 


(1983). 


We hrst randomly choose 400 locations from the list of 970 monitor locations. At these 
locations, we have access to all CIS covariates. Using the notation from Section 


3.3 


let Z 


be the matrix of the GIS covariates and thin-plate splines obtained using the coordinates 
of the selected locations. Using Z, we generate 19-dimensional multi-pollutant data. A, by 
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simulating its columns X [, j], using linear models of the form 


= Z-fj + ej, 

where j = with 7 j fixed coefficient vectors and i.i.d. noise vectors ej distributed 

according to a multivariate normal distribution with mean 0 and covariance matrix ajl. 

We choose the simulation parameters jj and aj so that some pollutants depend only- 
on GIS covariates; some pollutants depend only on thin-plate splines and some pollutants 
are just noise. In addition, we also generate some pollutants using a combination of GIS 
covariates and splines. As can be seen from the model above, all of this is easily achieved 
by making some of the 7 /s close to zero. Finally, we also introduce correlation between 
pollutants by forcing some of the y/s to be close to each other for different pollutants in 
order to make the simulated data reflect some of the correlation structure observed in the 
actual monitoring data. 

Additionally, we consider two different scenarios. Under the hrst scenario, we choose 
the parameters in the data generating model so that most of the simulated pollutants are 
predictable using UK. Under the second scenario, we make sure that most of the pollutants 
cannot be predicted well using a spatial model. We note that this can be achieved by varying 
the values of the coefficients 7 ^ and the variances of the error terms aj: increasing the values 
of 7 j corresponding to the GIS covariates increases the predictability of the pollutants using 
the geographical features of the locations, which in turn increases the predictability using UK 
since GIS variables are a part of the kriging model; increasing the values of 'jj corresponding 
thin-plate splines results in smoother pollutant surfaces, which increases predictability using 
UK since kriging captures spatial variability; and hnally, adding random noise with larger 
variances aj decreases predictability by making it hard to detect the true signal. 

At each simulation, we have 400 locations with multi-pollutant data. In assessing the 
performance of different models, we split the simulated data into two parts with 300 and 
100 locations. We apply the methods to data from 300 locations. Then, we predict the 
pollutants and scores at the remaining 100 location that were not used in building the 
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Table 2: Summary: Simulated data with pollutants with high predictability. The is ob¬ 
tained using UK. Absolute correlation is dehned as the average of three absolute correlations 
between the scores. Sparseness is the fraction of zero-valued loadings. Here, “Max.Scores” 
means that the penalty was selected to maximize the predictability of the principal scores; 
and “Max.Pollntants” means that the penalty was selected to maximize the predictability 
of the pollntants 

Predictability of 

Abs. Correlation 

Scores (R^) _ Sparseness (%) 

(Average) 

PCI PC2 PC3 




o 


Trad. 

PCA 

0.96 

0.87 

0.70 

0.04 

0.00% 


















s 

0) 

Plh 

Pred. 

PCA 

0.96 

0.88 

0.71 

0.05 

0.00% 


13 

0) 

Max. 

Scores 

Trad. 

Pred. 

PCA 

PCA 

0.97 

0.97 

0.91 

0.92 

0.84 

0.84 

0.38 

0.43 

35.22% 

35.53% 
























Trad. 

PCA 

0.96 

0.87 

0.66 

0.11 

19.96% 












03 

(Ih 

Pred. 

PCA 

0.96 

0.87 

0.74 

0.18 

20.04% 
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Table 3: Summary: Simulated data with pollutants with low predictability. The is ob¬ 
tained using UK. Absolute correlation is dehned as the average of three absolute correlations 
between the scores. Sparseness is the fraction of zero-valued loadings. Here, “Max.Scores” 
means that the penalty was selected to maximize the predictability of the principal scores; 
and “Max.Pollntants” means that the penalty was selected to maximize the predictability 
of the pollntants 

Predictability of 

Abs. Correlation 

Scores {R^) _ Sparseness (%) 

(Average) 

PCI PC2 PC3 




o 


Trad. 

PCA 

0.76 

0.60 

0.27 

0.02 

0.00% 


















s 

0) 

Plh 

Pred. 

PCA 

0.85 

0.76 

0.56 

0.08 

0.00% 


13 

0) 

Max. 

Scores 

Trad. 

Pred. 

PCA 

PCA 

0.93 

0.95 

0.79 

0.90 

0.28 

0.83 

0.31 

0.42 

52.47% 

69.96% 
























Trad. 

PCA 

0.79 

0.66 

0.28 

0.08 

19.14% 












03 

(Ih 

Pred. 

PCA 

0.88 

0.80 

0.61 

0.13 

25.69% 


22 



model. We compare these predictions to known valnes of the pollntants and scores. We note 
here that the known scores are dehned by projecting the observed valnes of pollntants at 
100 locations to hxed loadings obtained from applying PCA to data from 300 monitors. 

Tables 1^ and snmmarize resnlts from the simnlation stndy nnder different scenarios. In 
these tables, simnlated examples are rnn 100 times and all snmmaries are based on averages. 
The ordering of principal component scores is from the highest predictable to the lowest 
predictable score in terms of R^. In other words, at each simnlation, the best predictable 
PC score is called the hrst score, the second best predictable score is called the second score 
etc. Here, we dehne the absolnte correlation as an average of three absolnte correlations 
between all three scores, and sparseness is the fraction of loadings eqnal to zero. Here, since 
correlations with different signs can resnlt in an average close to zero, looking at the average 
of absolnte correlations is a more informative snmmary of dependence in scores. 

Table presents the hrst simnlation scenario where all pollntants are highly predictable 
nsing a spatial model. We see that traditional and predictive PCA perform similarly when 
there is no penalty and when the sparseness penalty parameters are chosen to maximize 
predictability of the scores. However, when the penalty parameters are chosen to maximize 
predictability of the pollntants, we see that the predictive method increases the for the 
third principal score compared to traditional sparse PCA. 

Table presents the second simnlation scenario where only some pollntants are highly 
predictable nsing a spatial model. We expect to see more of a beneht from predictive 
(sparse) PCA in this setting, and indeed the advantage is more prononnced regardless of 
how the penalty is chosen. With no penalty, predictive PCA increases predictability of 
the PC scores withont increasing the correlations between the scores. When the penalty 
is chosen to maximize predictability of scores, the average correlation between the scores 
tends to be nndesirably high. When the penalty is selected to maximize predictability of 
pollntants, both the traditional and predictive sparse PCA methods resnlt in relatively small 
correlations between its scores, bnt prediction accnracy is better for predictive sparse PCA. 
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4.2 Simulated Health Analysis 

While it is possible and straightforward to directly apply the methods described in this 
manuscript to health data from cohort studies to investigate the effects of multi-pollutant 
mixtures on health endpoints, to obtain reliable health effects estimates, the health analysis 


should also address additional challenges like accounting for measurement error (Szpiro and 


Paciorek, 2013), effects of preferential sampling of monitoring locations (Lee et al., 2015), 


and possible effect modifiers in the health model (Analitis et al., 2014 Delfino et al., 2014 


Park et al., 2008). We do not aim to present the results of an analysis of real cohort studies 


here, which is a topic for future work. Instead, in this section, we provide a simulated 
health analysis example of the effects of exposure to multiple pollutants using traditional 
and predictive sparse PCA. 


We follow the idea laid out in Section 4.1 and assign Ni = 300 locations as monitors 
locations and N 2 = 7075 locations for our cohort subjects. We select these locations from 
the full list of available 7375 EPA monitoring sites for convenience so that we do not have to 


recalculate the geographical covariates. We note that in Section 4H, we used the list of EPA 
AQS sites without missing data, but this does not affect the results of this analysis. We also 
note that we intentionally make the number of monitor locations small (300), compared to 
the number of subject locations in our simulated cohort data (7075), as this is usually the 
case in real cohort studies where sizes of cohorts are much larger than the number of the 
monitoring sites. We generate the true exposure surface measurements at Ni -|- N 2 locations 


using the methods described in Section 4H For shortness sake, however, we only consider 
the second scenario where we made most of the pollutants spatially unpredictable. From 


Section 4.1, we know that the predictive PCA results in principal scores that are predictable 
better compared to the scores from traditional PCA. Our primary goal here is to see how this 
affects the estimated health effects in the health analysis. After assuming that our simulated 
subjects live at the selected N 2 addresses, for each subject, we generate 5 subject specific 
covariates. These are A = age, R = race, I = income, E = education, and S = smoking 
status. We assume that age is a uniform continues variable between 30 and 80; and race. 
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income, education and smoking status are all categorical variables with 4 categories, with 
realistic proportions. We assume that we are interested in 2 health endpoints, Yi and Y 2 . 
For example, the hrst endpoint Yi could be a cardiovascular health endpoint; the second 
endpoint Y 2 could be a respiratory endpoint. In the true health model, we use the following 
to generate the health endpoints: 


Yi = 0.35 * Pi + 0.5 * A + 2 * R — 0.5 *1 — 1*E + S + Error 


( 11 ) 


1-2 = 0.6 * P 5 + 0.6 * Pg + 0.5 * A + 2 * R - 0.5 * I - 1 * E + S + Error, (12) 

where we assume the error terms are normally distributed with standard deviation equal 
to 10. Here, Pi, P 5 and Pg are the simulated pollutants, and age, race, income, education, 
smoking status are denoted by A, R, I, E, and S respectively. 

Finally, assuming that the true pollutant measurements are only available at the monitor 
locations, we want to study the effects of multi-pollutant exposure on Yi and Y 2 using the 
traditional and predictive (sparse) PCA. When we apply PCA approaches to simulated 
monitoring data, we only look at the hrst 3 principal components again. As noted before, 
with 3 components, we are able to explain approximately 70% of the original variability in the 
data. A more thorough approaches and strategies exist to select the number of components to 
consider in the analysis. Selecting the number of components using these approaches would 
be worthwhile but is beyond the scope of this paper. For each component, we choose the 
penalty parameters to maximize predictability of the pollutants. We see from the previous 
section that this yields scores that are predictable and not highly correlated compared to 
other methods to choosing the penalty parameters in both traditional and predictive PCA. 
Once predictions of the principal scores are obtained, we £t a regression model with principal 
scores as main predictors. In this model, we also include all the subject-specihc covariates. 
In fitting this model, we do not expect to recover the exact true parameters and variables 


used to generate the data in the true model in Equations H and In contrast to variable 
selection methods, both the predictive and traditional PCA are unsupervised methods to 
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extract mixtures of pollutants that occur together. When we study the effect of these 
mixtures on our health endpoints using predictions of the scores, we want to see if the the 
scores are signihcantly associated with the endpoints and if these mixtures contain the true 
“bad actors” used in generating these data. 

Results from one run of this analysis are given in Tables and First, from Table 
we see how well we can predict the principal scores from traditional and predictive PCA at 
subject locations. Here, to calculate these R^s, we use the predicted and the true values of 
the scores. The results of this table are consistent with what we see in Section 14.11 In this 
table, we also see the interpretations of the principal scores derived from the sparse loadings. 
For example, the hrst principal score is mostly a mixture of pollutants 1, 2, 4, 9, and 14. 

In Table we present the results of the health analysis. In Table for Yi, at the 
signihcance level of 0.05, we see that PC2 from predictive PCA is associated with Vi. From 
Table 1^ we see that an important part of this mixture is the pollutant 1, which we know is 
the true predictor of Yi in Equation[T^ Here, we also see that the PCI from traditional PCA 
(which also contains the pollutant 1) is marginally signihcant. For ¥ 2 , for predictive PCA, 
we see that the PCS is associated with Yi. From Table this is a mixture that contains 
the pollutant 8 which was used in Equation We see that the PC2 (which also contains 
the pollutant 5) is marginally signihcant. The pollutant 5 is also used in the true model in 
Equation For ¥ 2 , for traditional PCA, we see that the PCS is associated with Yi. The 
PCS here is a mixture that contains the pollutant 8. From this example, it is clear that 
the predictive PCA outperforms the traditional PCA when the goal is to study the effects 
of principal mixtures on the health endpoints. In our example, we observe this for both 
Yi, where the signal for the score from predictive PCA was stronger, and for ¥ 2 , where the 
traditional PCA missed the did not capture the Pollutant 5. We note that when we ran the 
simulated health analysis multiple times, the predictive PCA preformed better consistently. 
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Table 4: Interpretation of simulated principal scores 


PCI = 0.6) Mostly a mixture of pollutants 1, 2, 4, 9, 14 

Traditional 

PC2 iR^ = 0.6) Mostly a mixture of pollutants 1, 7, 11, 12, 13, 15, 17 

sPCA 

PCS {R^ = 0.6) Mostly a mixture of pollutants 6, 7, 8, 10, 


PCI {R^ = 0.6) Mostly a mixture of pollutants 1, 2, 4, 9, 14 

Predictive sPCA PC2 {R^ = 0.6) Mostly a mixture of pollutants 1, 2, 4, 5, 9, 14, 17 

PCS {R^ = 0.6) Mostly a mixture of pollutants 6, 7, 8, 10 


Table 5: Inference for simulated health model coefficient 


Traditional sPCA Predictive sPCA 




beta 

pval 

beta 

pval 


PCI 

-0.29 

0.08 

-0.01 

0.89 

Endpoint 1 

PC2 

-0.04 

0.66 

2.07 

0.04 


PCS 

-0.02 

0.80 

0.18 

0.86 


PCI 

-0.27 

0.11 

-0.36 

0.72 

Endpoint 2 

PC2 

-0.06 

0.53 

1.83 

0.07 


PCS 

-0.22 

0.02 

2.24 

0.03 
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Figure 1: Scatterplots of the predictions of the scores without a penalty. These are obtained 
using cross-validation and UK 


PC1 : R2 = 0.88 


Traditional PCA 

PC1 : R2 = 0.58 


PC1 : R2 = 0.45 



-10 -5 0 5 

Observed Scores 



-50 510 

Observed Scores 



-4 -2 0 2 4 

Observed Scores 


PCI : R2 = 0.89 



-10 -5 0 5 

Observed Scores 


Predictive PCA 

PCI : R2 = 0.59 



-50 510 

Observed Scores 


PCI : R2 = 0.68 



-4 -2 0 2 4 

Observed Scores 


5. APPLICATION TO MULTI-POLLUTANT MONITORING DATA 
We now apply traditional and predictive (sparse) PCA to the multi-pollutant monitoring 
data described in Section As an exploratory analysis, in Section Table [T| we observed 
that while some pollutants are highly predictable, there are others that have poor cross- 
validated R^. This suggests that PC scores obtained from projecting the multi-pollutant 
data onto traditional PC loadings may not have good spatial predictability. 

First, we apply traditional and predictive PCA without a penalty. Figure [T] shows scat¬ 
terplots obtained from 10-fold cross-validation. From this hgure, we conclude that the cross- 
validated for the PCI and PC2 scores from the two methods are very similar. This is 
also clear from looking at the bar plot of the corresponding loadings in Figure where we 
see that loadings from traditional and predictive PCA for the hrst two principal component 








Figure 2: The principal component loadings from two PCA approaches without penalty 
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Comp. 2 
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scores are close to each other. We note, however, that predictive PCA increases the pre¬ 
dictability of the PCS scores from 0.45 to 0.68. From Fignre|^ where pollntants are ordered 
from the smallest individnal B? to the highest, we see that the different PCA methods have 
different loadings for PCS. 

Now we apply both traditional and predictive PCA with a penalty to indnce sparseness. 
A snmmary of the resnlts from this analysis is given in Table If we compare this table 
to Tables and we notice that Table contains information abont the exact correlations 
between the principal scores, whereas in Tables]^ andwe only present the average absolute 
correlations between the scores. We do this because in simulated examples, the definition of 
the order of the principal components is different. In other words, the role of PCI, PC2 and 
PCS changes from simulation to simulation making it confusing to interpret the meanings 
of the of the individual correlations. In Table sparseness is again defined as the fraction 
of zeros in the PC loadings vector. Sparseness is equal to zero for non-sparse PCA for both 
traditional and predictive PCA, as can be seen in the hrst part of Table When the penalty 
is chosen to maximize predictability of the scores, we see that the cross-validated i?^’s for all 
PC scores are high for both methods. In sparse predictive PCA, these are above 0.90. 
However, we note that both methods result in scores that are highly correlated. 

When the penalty parameter is chosen to maximize predictability of the pollutants, Fig- 
ure|^and Table|^show that while the PCI and PC2 scores from the traditional and predictive 
versions of sparse PCA have similar i?^’s, the predictability of the PCS score from the predic¬ 
tive method is higher than from the traditional method. The bar plot of the corresponding 
principal loadings for the predictive and traditional sparse PCA are given in Figure Here, 
we see that the noticeable differences in the loadings are again in PCS. As expected, we 
also notice that the loadings now are sparse. From these, it is also clear that the predictive 
sparse PCA results in lower correlations between scores when compared to the traditional 
method, which is consistent with the conclusions obtained in simulated examples in Section 
1^ Additionally, we see that the traditional PCA has higher sparseness compared to the 
predictive PCA in Table However, we do not believe that is something that is systematic 
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Table 6: Summary of results. The is obtained using UK. Sparseness is the fraction 
of zero-valued. Here, “Max.Scores” means that the penalty was selected to maximize the 
predictability of the principal scores; and “Max.Pollutants” means that the penalty was 
selected to maximize the predictability of the pollutants loadings 


Predictability of Abs. Correlation 

Scores(i?^) _ (Average) _Sparseness (%) 






PCI 

PC2 

PC3 

PClvsPC2 

PClvsPC3 

PC2vsPC3 


o 

(-j 

0) 

Trad. 

PCA 

0.88 

0.58 

0.45 

0.00 

0.00 

0.00 

0.00% 

S 

Pred. 

PCA 

0.89 

0.59 

0.67 

-0.05 

0.03 

-0.04 

0.00% 



Max. 

C/D 

0) 

Trad. 

PCA 

0.92 

0.88 

0.64 

0.96 

-0.14 

-0.08 

54.39% 

4-2 

O 

o 

zn 

Pred. 

PCA 

0.93 

0.90 

0.89 

0.97 

0.97 

0.99 

43.86% 

0) 
























S 


C/D 

"S 

05 

4^ 

Trad. 

PCA 

0.88 

0.59 

0.57 

0.17 

-0.15 

-0.19 

47.37% 




Pred. 

PCA 

0.89 

0.59 

0.66 

-0.04 

0.01 

0.01 

17.54% 
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between the methods as we did not observe the phenomena that sparseness was snbstan- 
tially different for two methods in simnlated examples. Looking at Fignre|^ we see that the 
difference in sparseness of the loadings in Table is mostly becanse of the differences in the 
third principal loadings. We speculate that while the loadings values for positive weights are 
similar, the predictive PCA is adding elements to the mixture with negative loadings values 
to increase the overall predictability of the hnal principal score. This elements are given 
a loading values equal to zero in the traditional PCA. In Figure we present a heatmap 
of the predictions of the scores from predictive sparse PCA. These heatmaps show clearly 
how different mixtures derived from different scores vary spatially When we apply universal 
kriging model to predict the sparse principal scores in Table the estimated range for all 
the scares was around 600km for both traditional and predictive sparse PCA. The estimate 
of the partial sill and nugget effect for the first principal score was 40 and 0.92 for traditional 
PCA, and 37 and 0.82 for predictive PCA; for the second principal score, 79 and 0.76 for 
traditional PCA, and 73 and 0.81 for predictive PCA; and for the third principal score, 69 
and 0.21 for traditional PCA, and 65 and 0.19 for predictive PCA respectively. 

6. DISCUSSION 

In this paper, we describe a novel predictive (sparse) PCA approach that can be used to 
obtain exposure estimates in air pollution cohort studies. The approach can be used with 
or without a penalty. In contrast to traditional PCA, we demonstrate that predictive PCA 
may lead to principal scores that can be predicted well using geographical covariates and 
splines in the exposure prediction stage. We have also shown that penalty parameters in 
predictive sparse PCA can be selected so that the correlations between the resulting scores 
are small. Based on the simulated examples and data analysis, we observed that when 
the penalty parameter is selected to maximize predictability of the principal scores, the 
resulting loadings are nearly orthogonal and the scores are almost uncorrelated. In our data 
analysis, we apply the traditional and predictive (sparse) PCA approach to multi-pollutant 
data from EPA’s regulatory monitors from 2010. In Table [T| we present a summary of how 
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Figure 3: The principal component loadings from two PCA approaches with penalty 
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Predicted Scores Predicted Scores 


Figure 4: Scatterplots of the predictions of the scores with a penalty to maximize predictabil¬ 
ity of the pollutants 
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Figure 5: Heatmaps based on the predictions of the scores from predictive sparse PCA 
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individual pollutants can be predicted using the commonly used universal kriging model 
with GIS covariates. In Table and Figures [T] and we show how the different mixtures 
obtained from the principal scores can be predicted. In Figures and we present the 
component loadings of the principal scores obtained using the predictive and traditional 
(sparse) PGA. Finally, Figure provides further insights on how the principal component 
scores vary spatially across the U.S. 

In predictive (sparse) PGA, we do not use the health data when applying the algorithm 
to multi-pollutant data. Therefore, one may think of our predictive PGA as an unsupervised 
dimension reduction approach. On the other hand, unlike traditional (sparse) PGA, the 
approach is “supervised” by geographical data to increase predictability of the PG scores. 
Alternatively, one can potentially develop a dimension reduction approach that is fully super¬ 
vised by both geographical and health data. This, for example, can be done using a Bayesian 
framework by introducing lower dimensional latent variables which govern the joint distri¬ 
bution between pollutants and the health variable. In joint models, in other words, the 
first stage exposure model is a part of a unified model for the exposure and health data 


(cf. Sinha et al. (2010) in nutritional epidemiology and Gryparis et al. (2009) in air pollu¬ 
tion epidemiology). We argue against such a joint approach for four reasons. First, using 
a two-step approach in air pollution studies offers a way of “cutting feedback” that may 
drastically influence estimation and prediction in the exposure model in unexpected ways, 
especially when the prediction model is misspecified. Second, when simpler two-step ap¬ 
proaches are used, performing a sensitivity analysis using different exposure models in the 
hrst step can be done easily to understand the effects of these models. Third, using two-step 
approaches allows use of the same predicted exposures across studies with multiple health 
endpoints which makes such approaches more practical and scientifically desirable. Fourth, 
an inherent disadvantage of a unihed approach is that it requires specifying high dimensional 
joint distributions (and the associated priors in a Bayesian framework). By simple virtue of 
the high dimensionality, this task is computationally much more expensive than a two-stage 
approach, particular in the context of air pollution studies where there are less than 1000 
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monitors that are used in the prediction step and a number of subjects that are in the 10s 
of thousands in the health analysis step. For more details on discussion between unified and 


two-step approaches, readers are referred to Szpiro and Paciorek (2013). 

In the literature, when both response variables (as our multi-pollutant data) and co¬ 
variates (as our GIS covariates and thin-plate splines) are multivariate, methods known as 
two-block partial least-squares (two-block PLS) can be used for analyzing the covariance be¬ 


tween the two sets of variables (cf. Rohlf and Corti, 2000; Sampson et ah, 1989 Streissguth 


et ah, 1993). These methods are similar to the well-known method of canonical correlation 


analysis (Jackson, 2005). Even though two-block PLS is related to our PGA approach, in 
predictive PGA, our goal is quite distinct: we want to improve predictability of the PG 
scores using the same GIS covariates and splines and to explain most of the variability in the 
multi-pollutant data. In contrast to our goal, in two-block PLS and canonical correlation 
analysis, the two sets of variables are treated symmetrically in an attempt to find possible re¬ 
lationships between the underlying latent variables derived from the sets. These approaches 
are useful for exploring the combinations of variables in the two sets that account most for 
whatever covariation there is between them. 

In this paper, we considered predictive PGA both with and without a lasso-type penalty 
on the loading vectors to induce sparseness. In principle, we could also add additional penalty 
parameter to the a coefficients in Equation ([^, which could be particularly useful for in¬ 
corporating richer spatial prediction models by allowing for a way to eliminate insignificant 
geographical covariates. This may, however, substantially slow down our iterative optimiza¬ 
tion algorithm described in Section |3.1 As future research, a related idea is to exploit the 


link between random effects and penalization (cf. Ruppert et ah, 2003 Verbyla et ah, 1999) 
to develop an algorithm that uses a random effects model for the a. Another promising 
venue for future research is to develop a predictive dimension reduction approach that hnds 
all principal loadings and scores in one step. In contrast to our sequential approach, this 
would facilitate borrowing information between scores in order to potentially further improve 
predictability. 
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When variability of high dimensional and correlated variables is of interest in the context 
of redncing the dimensionality of the data, factor analysis is another widely nsed statistical 


techniqne (cf. Harman, 1967). In factor analysis, the original variables are defined as linear 
combinations of the nnobserved variables called factors with error terms nsing a statistical 
model, whereas in PCA, the components scores are calcnlated as linear combinations of the 
original variables. Therefore, the main goal in factor analysis is to explain the correlations 
between the variables with these lower dimensional factors to nnderstand what constructs 
underlie the data. In PCA, our goal is to extract the lower dimensional scores to reduce the 
dimensionality of the correlated observed variables to a smaller set of important independent 
composite variables in an interpretable way for subsequent use in the health analysis. 

In our approach, we achieve sparsity by adding a penalty term to the optimization prob¬ 
lem. This allows for an easy way to incorporate the CIS covariates into a new optimization 
problem in the predictive PCA. In both PCA and factor analysis, in order to make the 
interpretation of the scores or factors easier, one can also achieve sparsity via rotation of 
the loadings. There are two main types of rotations that can be used: orthogonal, where 
orthogonality of the loadings is preserved and oblique, when the new loadings are not neces¬ 
sarily required to be orthogonal to each other. The most commonly used technique based on 


rotation of the loadings is called varimax rotation (cf. Kaiser, 1958). Similar to sparse PCA, 
varimax rotation can result in scores or factors that have a small number of large loadings 
and a large number of zero loadings. Potentially, we note that it may also be possible to 
develop an alternative predictive sparse PCA approach based on rotation techniques. 
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